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A 



yec 



tive. 



of 



^he -IJational Assessment of Educatio.nal Prpgress has Jbhe charge 
gar^iering and reporting educational achievement' data, that, are 

accurate representation of absolute performance now:^ 
e.^g. , 82% of the nation' s^* nine-year-olds can multiply 
3 x\0 (NAEF/ January ^ 1975) . . - . ; , 

- a precise representation of performance now, relative 'to 
perfo^ance three td seven years ^go:-. e.g.,,i^ 1973, 47% 
of 17-^ear-olds knew the purpose df aij eldtitrical .transformer , 
^ / This i^a 13% decline since '^969. (NAEP , Jlay 1^75) , « ^ 

The. baseline meksure of absolute performance must be reliable,:, ^ 
thpugh not necessarily in the- accepted psychometric sense. ^'Tt m^y ] 
be "better to say; that: the measures n^st be "accurate/" A ineasure 
may be reli^able without being accurate: a scale that consistently 
adds ten pounds to 6ne's. weight m'ay be perfectly reliable . t 



For the relative 'change measures, reli^tble biases are 'unimpor- 
tant, since the difference betwefntwd biased measures is th6 same' 
;as*the difference between two i^nbiased meastii;fes, if the bias is 
simply a constant tbat cancel? out* , However^ the bias in a measure^ 
may also change oveir time. ..Suppose, for example, one saw an increase 
of 2%^ in 13-year-olds' performance o;i a certain reading task. 
Suppose, however, that there was also a 6% declin^^ i^ 'the non-respons 
katfe: suppo.se, that is, that 6% morie respondents? wete guessing 
r^thea; than leaving the ^ item blahk. Even if children cou],d guess 
no better, than chance , , one ' would expert 1.5% more children, on a 
four-alternative 4, tem, to get the correct ans\?er because of that 
chaixge in response patterns. This would change a statisti^cally ; 
/stable 2% improvement *to -a non-s^ign^-f icant .5% improvement. 

National 'Assessment has always 'used an "I don't know" foil on 
all .cognitive multiple-choice itemsJLto discourage 'Respondents from 
<^essing. Guefesing not oniy. inflates the estimation at ohe« point 
in j>ime of the percent of respondents wfib can'dd-th^ task, but also, 
a change* in guessing behavior, (as Illustrated above) can affect the 
•interpretation of a change in percent of* sucdess over time. Unfor- 
-tunately,;the pattern of response to an "I don't^now" (IDK) foil 



^A 2%* increa-se,^ since, there are 3.6 million 13-year-olds., in the 
nation/ means /that 72, '000 mo^re children can perform the reading 
task./ ^ . - . * 



can als\? differ for different groups. Sherman (1974) found that 
some soiitheasterners / females, blacks, and "rixral persons us^ the 
IDK poorly.' Thus the "I don't k^nowv can contribute to bia^ in a - , 
'measure at . one point in time or change measuifes oyer time, as much 
as a dif f eifential tendency to omit items. / 

Other pot^nti^l sources of bias over time include^^ changes 
procedural matters, . such* as trainingi?6r experience ."fifest adminis- 
trators, or school cooperation, or type Of* print used, in packages 

(test booklets) r or the voice which reads' the exercises oa tape. 

'One of the' most, serious potential sources-of bias arises because 
National Assessment releases some exerc-ises fpr^ publication ^nd 
does not. reuse t^^em. The remaining iinreleased^ exercise's are then 
repackaged arid' reassessed, for change. , ^jSince they have bee^n , 
repackaged, ttfey are presented for the. second titne\in different 
orders, different contexts, and different jg(bsitions\ Any of^ these 
variables may affect performance and thus, either mask, or exaggerate 
the chcuige in performance over time. \ * ^ 

Thus ,^ National Assessment measured pf change as well as baseline 
performance/ must be extremely accurate. From the firat, Nati^onc 
Assessment Tias .devoted great respurce*^^ to precise sampling desi^g^. 
In the ^ last few years, it' has also begun- to devote resources tc 
locating sources of non-sampling error, Many of these non-^saatpling^ 
etfrors have been dismissed 'a^ unimportant ,on coi>ventional tests. 
Conventiorial, test? are collections of items, the sum of which is 
taken to irieasure a tather globally-defined trait-^such as "intelligence* 
or "aritfm\e|:ic achieverrtent"7-and, thep only relative. to some no'rm . ^ • 
group* Inaccuracies, due to individual items and th^ "examinees ' \ , 

response to thefh can, to some extent, be supposed, to average out ii> 
the total score. Because of NAEP's- item-by'- i.tem reporting, th^de 
errors once again become important. ^ ' " , . ' T ^ « 

Objective^. ^ - ^ , ' ' , - \ 

The purpose of the present • s*tudy was to determine the effect " ' ' 
of two spurces of oon-sampling error; position in packagi^ (beginning^ 
middle .or end^ df the assessment instrument) and exercise format 
. (multiple-^chpiQe wil^h ah "I dOn't.knbw" alternat^ive , multiple choice . 
without IDK, and open-r-erided) . ObViothsly/ position in package is a 
source pf error that one canno-t eliminate, .since some exercise or 
'Other always tnust be first, middle qr last in a package 1 " It. is a . 
source of eri^or ^hat can* be held ^constant over time, jiowever,^ if it ' 
is found to l|e important. / Further study of the* IDK toil may show, 
i^h^t 4-t shoul'd be dropped (in. future item development: it cannot 
be droppec^ from ^oharige exeircise^ even tliough bi^s is strongly 
suspected); replaced .with' corrections for guessing; retained; 
retailed , but ^suppligjuented .with corrections for guessing. The* 
present study can provide some data to answer -these questions; \ , \ 
hoVever, it, must be 'emphasized that the present data were* al'l^ 
collected at one time and so cjjaestions about change analyses cannot 
•be fully answered. \ \ \ , . 



Methods /and Data Source > . / 

The preserlt study was int:lud^d in the 1973-74 assessment of 
Writing and Career and Occupational Development. It is, therefore, 
based on .national probability samples of 2,500 9-year-olds, 13-year-, 
olds or 17-year-olds for each item. At each ,age, nine different 
packages were involved, and tl^us nine different samples of 2,500 
respondents. Each package was a block in the 3^ balanced incomplete 
blocks design used at qach age*. The three factors *in the design 
\were ; - , < ' 

\ > 

exercise content - three different science questions were 
developed, such that exactly the same stem was used for both 
multiple-choice and open-ended formats; 

format - multiple choice with IDK, multiple without: IDK, 
and open-ended;"* , ' * 

position in package - beginning, middle, end. 




Eac 
repr^s 
example. 



of the nine packages con^S^^ed three\ exercises which 
each content, each format and each position. For 
ckage "#1 at ag6 9 -.contained 



Beginning 

Exfercise about blo<j)d 
circulation,. 
Multiple-choice 
without IDK- • - . 



Middle 

Exe r c i se about- 
larg^est living 
animar, 
Op^n-'fendJd 



End . 

Exercise about 
lightning and' 
thunder. 
Multiple-choice 
with IDK 



See attachments 1, 2, and 3 for the wording of the exercises in 
-the multiple-choice and IDK format. Attachments 1, 2, and 3 also 
give the natiqnal percents for each foil (including IDK and^ no 
response) for each exercise, format and positioii. 

i 

Results. ' . . 

— t \ 

The design was set up so that^th^ analysis of variance estimates 
for the main effects were unconf oundedN with flocks, but all inter- 
actions wer$ partially confounded.^ To get some independent estimate 
of these block effects, a marker exercise was placed at the end of 
each of the nine packages. This marker exercise allowed an empirical 
estimate of the sampling vai^iability; dt also contained variation 
d-ua'^to ^the/accumulatpd effect o^ differing contexts of presentation, 
since. -the. nine packages all contained different Wr^tiiNig arid COD 
. eitercises. This^ marker exercise contained five parts (five different 
que.stions about ixeading a map|^. The vari'ance component due to '\ 
.parts within blocks — that:is,^the natural variation in the difficult]^^ 
.of the five questions — was at least 50^ ti^ne?'^ greater than the / 
comjponent due to the block effect. Thus the-l)lock effect, though ' 



^Components of the interactions wer^ calculated by the modular 
arithiri'^tic method described in Winer (1971, p. 606ff ) . 



in some cases statistically significant because^ of large sample * 
sizes, was very* small compa;:ed to the normal variation among 
exercises. There is a second reason for disregarding possible 
block effects.' Inspection of the analysis of variance tables 
(attachmentW'*4 , 5 a,nd 6) shows that the mean squares for confounded 
interaction parts were about the same size as the mean squares for ^ 
the unconfounded interaction parts. Both of these pieces of evidencfe 
indicate that the main wijthin blocks analysis can be interpreted 
straightforwardly i 

At all ages, there was a large main effect for exergise 
contentsr--which is simply tp say that some questions^ were harder 
than others. There w,as also a main effect .^^-^-fifma t . Qnly at 
age 9 was there a significant position effect. It did not appear 
to be a fatigue effect, which ^might be expected with these young 
children, but rather a disadvantage in j^eff enhance to the beginning- 
of-the-package exercises.. It should be noted that these beginning 
exercises were never first in package, b\it simply occurr-ed within 
the first f;Lve minutes of testing. Again at all *ages there were 
canterit by format interactions^ which can basically be inteirpreted 
as .proving that some tasks are more difficult than others in the 
open-ended format. ^ 

• The significant fprmat effect deserves 'further discussion. 
Exhibit 1 displays the mean percent co2;rect (averaged' over the > 
three positions) for each exercise in each format at each age. 



Exhibit 1. ^Means and Standard. Deviations (in pai'entheses) 
for Three Different Formats of ^ Exercises- 



Age 9 



Age 13 



Age 17- 



Exercise 
1 * 
2 
3 

average* 
I 
2 

* 3 
.. average * 
1 
il 
^ 3 
average * 

ov^all 
av^age * 



Multiple Qhoice 
- IDK xf IDK 



.Open-Ended 



76.9 (1.80) 
^ 27.5 (2.07) 
65.2 (1.61) 


,68.9 (0.76) 
26.4 (4.74) 
62.3 (1.37) 


52.6 (0.67)^ 
5.5 (0.75) 

22.7 (2.15") 


56.51 (.1.83) * 


52.52 (2>88)* 


26>91 (1.87) * 


47.4 (1.29) 
59.1 (1.55) 
23.8 (0.53) 


41.0 (1.13) 
58.6 (1.53) 
22.9 (1.66) . 


^.8 (0.6-1) „ 
32^9 (3.50) 
10.7 <0.65)- 


43.42 (1.20) * 


•40.86 (1.45) * 


16.44(^.08)x* 


53. 5N (0.67) » 
72.7 (2.73) 
17.7 (2.22) 


44.8 (1.15) 
7i.3 (0.49) 
14.6 (1.06) 


13.5 ,(1\.00) 
48.9. '(1^2^) ' 
11.0 •(0.\4^i 


47 .96 (2.06) * 


4.3.57 ( '.94)* 


24.4 9 (l.Ol^X* 


49. 30 (1.74) * 


45, &5 (l'r94)* 


2^.61 (1.55) 



* standard 4eviations In the "average" column are the sgUare root 
of pooled within-cell variances. ' 




\ 



- 5 - 



Tlie ^'bverall average (the last line in the tahle) shows a large 
* dii^fq^rence between open-ended and multiple-cShoice foipats and a 
.,'smaller — but still statistically signif ic^f ^'7~dif f ererice between 

the two multiple-choice formats. . Having the "I don't kitow" foil 
. does reduce the overall , percent correct. Ha^ng th^ "IDK" ^may 
^ slightly increase the variance / .b\^ this e^perimeht |was not 

sensitive enough to detect it.*** \ \ \^ 



Importance of the Study. > ^ . * 

This i3 one of. a series of studies to locate s6urces of noh- 
sampling errors in the estimates of performande^ on assessme;it tasks. 
TJie goal is to increase the accuracy of baseidne and change esti^mat^es, 




975] 



great precision, is required to detect cl 



This study has resuL-^ed in several furtlter^si^yestigations. 
Because of the' stability of the small , block ip^ka<^ effegt, NAEP 
staff is now looking^ at .methods of adjusting sam^in^ y^ighjts to 
make packages more comparable. Because of some in|s^sist.ejicy in 
performance of IDK vs no-IDK multiple-choice exercis^^ staff is 
continuing to examine the use of item-scoring formula^' (versions 
of co«:ectioii-f or-guessing techniques) as an alternative .tx> th^ 
"I don't know" foil; These'' invest legations will hopef^illy lead to 
new techniiquds for increasing^ the accuracy of assessment results, 

\ ■ 



= 8.11 
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V r. Att^^chiaent 4,. Analysis^ of Variance 'Table for Age 9 Design ' 



Source of .Variation' 



.Degrees of 
• Ereedom 



\ 



Sum of 



Mean 



Within Block 6 

X: ' - ' • 

Exercise: - E 
Format,: F 
• Position: P 



E X F 



E X P 



F x" P 



E >X. F X P 



(EF^) 



(EP^) 



(F.P) 



(EFP) . 
(EFP'^) 
(EF^>P) 



Between Blocks 



(•EF)' 
(EP) 
(Fp2) 
(EF^P^)"* 



(2) 

C2) 
(2) 



i2\ : 
- (2j 
(2) 



(3) 

(2) , 
(2) 
(2). 
(2) 



18 

2 
2 
2 



2. 
2 



^ . --^ — 

• 

9976.4 


' \ 
/ 

4988.2 


464410 


2322.0 


27.1 


i3;6 . 


(153.1) , 


f 


153.1" 


76.5 


. (6.3) 


' 4 


6.3 


3.^1 


_ (0.4) 




0.4. 


.2 


(5.5) 




(0.5) 




(12.8) 




18 ..9 


3.1.; 


- ■ ^ . 




(425.8) - 




(8.9) 




(9.5)' • 




/(10.2) 





*a <• .05 
**<x < .01 
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' Attaahiaent 5.; Analysis of Variance Table^^jEj^l- Age 13 Design 
Source of Variation 



Degrees of 
Freedom 



i 

- Sum of 



Mean 



Within Blocks 






18 . 


• 










Exercise :r ' E 






■ 

2 




4411.3 


220-5 


.7 


689** 


Format: F 






2 ■ ' . 


\ 


3990.9 


1995 


.4 


624** 


Position: P 






, 2 


• 


7.4 


' 3 


.7 


. 1.2 


• ^ .V 


(EF^m 


(2) 




(430.3) 










E X F • ' 


, '1 




2 • . 




430.3 


215 


.!2 


67* 




2 

(EP ) .^-* 


(2) 




(.721) 
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Attachment' 6. Analysis of .Variance Tabl6 for Age- a 7" .Design 



Source of Variation 

Within Blocks 

Exercise: E 
Format: . F 
Position: ^ P 



Degrfee^s of 
^'reedom 
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E X V 



F X Pv 
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